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We  consider  the  optimal  use  of  information  in  shooting  at  a  collection  of  targets,  generally  with  the  object  of  maximizing 
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Introduction 

The  subject  of  this  paper  is  the  manner  in  which  informa¬ 
tion  that  is  gradually  acquired  about  the  status  of  a  target 
ought  to  influence  the  process  of  shooting  at  it.  It  is  a  mil¬ 
itary  subject  that  has  become  increasingly  important  with 
the  advent  of  long-range,  accurate,  but  expensive,  weapons. 
For  example,  suppose  that  the  single-shot  probability  of 
killing  a  target  is  0.9,  but  that  the  target  is  so  important  that 
even  a  0. 1  survival  probability  is  not  acceptable.  One  could 
shoot  independently  twice  at  the  target,  thereby  achieving 
a  kill  probability  of  0.99  at  the  expense  of  two  shots.  Alter¬ 
natively,  one  could  shoot  at  the  target,  look  to  see  if  it  has 
been  killed,  and  then  shoot  again,  if  necessary.  The  kill 
probability  is  still  0.99  with  this  shoot-look- shoot  policy, 
but  the  average  expenditure  of  shots  is  only  1.1 — just  over 
half  of  the  two-shot  expenditure.  The  payoff  for  acquiring 
and  using  information  optimally  can  be  significant. 

Use  of  the  term  “shoot-look-shoot”  sometimes  implies 
that  only  a  single  look  is  contemplated,  but  not  here.  In 
general,  we  will  consider  multiple  error-prone  looks  and 
shots  in  stages.  We  will  consistently  use  the  term  “kill” 
because  of  the  problem’s  military  heritage,  but  terms  such 
as  “damage,”  “identify,”  or  “find”  could  also  be  substituted. 
The  essential  requirements  are  that  the  “target”  be  in  one 
of  two  states,  one  of  which  is  desirable  and  the  other  not, 
and  that  the  marksman  have  a  succession  of  opportunities 
for  altering  and  discovering  the  target’s  state. 

The  general  problem  considered  here  involves  a  matrix 
P  —  {Pij),  where  P^j  is  the  probability  that  a  shot  of  type 
i  is  effective  against  target  j.  All  attempts  to  kill  a  target 
are  assumed  to  be  independent,  as  in  the  two-shot  example 
above.  There  may  also  be  data  associated  with  the  discov¬ 
ery  process.  Because  the  targets  are  not  always  identical,  it 


is  necessary  to  attribute  a  weight  or  value  Vj  to  target  j,  and 
the  firing  problem  usually  takes  the  form  of  an  optimization 
where  the  object  is  to  maximize  the  expected  value  of  the 
total  target  value  killed. 

The  expensiveness  of  shots  can  take  three  forms:  one 
(§1)  where  the  number  of  shots  of  type  i  is  constrained 
to  not  exceed  some  given  level,  one  (§2)  where  shots  are 
available  in  unlimited  quantities,  as  long  as  they  are  paid 
for,  and  a  third  form  (§3)  where  shots  are  constrained 
as  well  as  costly.  The  time  at  which  a  target  is  killed  is 
usually  irrelevant;  §4  is  an  exception  where  rewards  are 
discounted. 

1.  The  Constrained  Case 

We  assume  in  this  section  that  the  problem  is  to  assign  a 
given  set  of  shots  to  a  given  set  of  targets. 

1.1.  Perfect  Information,  Infinite  Time  Horizon 

Assume  that  the  results  of  each  shot  are  revealed  immedi¬ 
ately  after  each  shot  is  made,  and  that  shots  can  be  made 
one  at  a  time  because  of  the  infinite  time  horizon.  The 
import  of  these  assumptions  is  that  every  shot  is  made  with 
exact  knowledge  of  the  state  of  the  target  set.  We  take  the 
state  of  the  firing  process  to  be  {S,  T),  where  S  and  T  are 
the  sets  of  remaining  shots  and  live  targets,  respectively. 
The  object  is  to  find  V{S,  T),  the  largest  amount  of  target 
value  that  can  be  killed  with  all  remaining  shots,  on  the 
average,  together  with  the  shot  i  e  S  that  should  be  taken 
against  target  j  &  T.  If  either  5  or  T  is  empty,  then  of 
course  V(S,  T)  —  0. 

y(5,  T)  can  be  found  by  a  dynamic  programming  recur¬ 
sion.  Because  there  are  only  two  possibilities  for  each  shot, 
and  because  all  shots  are  independent,  by  the  conditional 
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expectation  theorem  we  have 

Vis,  T)  =  ,  max  +  y(5  -i,T-  j)) 

+  il-P.j)V{S-i,T)}.  (1) 

If  S  has  m  shots  left  in  it,  and  if  T  has  more  than  m 
targets,  then  (1)  will  have  to  be  evaluated  2"‘  times  in 
the  process  of  computing  y(5,  T).  Computational  difficulty 
can  be  expected  as  m  becomes  large. 

If  Vi  =  V  and  Pij  =  p  for  all  i  and  j,  then  (1)  simplifies 
considerably  because  the  sets  S  and  T  merely  need  to  be 
counted.  Let  s  and  t  be  the  numbers  of  shots  and  targets, 
and  let  Z  be  a  binomial  random  variable  with  parameters 
s  and  p.  X  can  be  interpreted  as  the  number  of  effective 
shots  in  the  set  S.  The  number  of  kills  will  be  X  unless 
the  marksman  runs  out  of  targets.  Therefore  V{S,T)  = 
£(min(Z,t)),  a  relatively  simple  computation.  Anderson 
(1989)  notes  that  the  same  formula  applies  as  long  as  Pij 
does  not  depend  on  y;  Z  still  has  the  same  interpretation, 
although  it  is  no  longer  binomial.  See  also  Przemieniecki 
(1990). 

Because  (1)  is  computationally  challenging  for  large 
problems,  we  next  develop  some  bounds  on  V{S,  T). 

A  simple  lower  bound  V_{S,  T)  can  be  constructed  by 
computing  the  optimal  pair  (/*,  j*)  =  argmaXjg^j  j^jVjPfj. 
This  is  the  “myopic”  firing  policy — every  shot  is  taken 
without  regard  to  future  consequences.  Equation  (1)  (with 
ii*,j*)  substituted  for  {i,j)  and  y_()  replacing  y()  on 
both  sides)  must  still  be  employed  to  evaluate  V_{S,  T), 
so  exact  evaluation  is  still  difficult  as  m  becomes  large. 
However,  the  myopic  policy  is  trivial  to  implement  because 
knowledge  of  y_(S,  T)  is  not  needed. 

The  myopic  policy  is  not  always  optimal.  For  a  coun¬ 
terexample,  consider  two  shots  and  two  targets,  with  P  = 
[o*9°Q®]  and  V—  (1,1).  The  myopic  policy  will  assign 
shot  1  to  target  1,  after  which  shot  2  is  useless,  and  the 
total  score  is  1.  The  optimal  policy  is  to  assign  shot  2  to 
target  1,  and  then  shot  1  to  target  1  in  case  of  failure,  or 
shot  1  to  target  2  in  case  of  success.  The  optimized  total 
score  is  0.9(1  +  0.9)  +  0.1(0  -|-  1)  =  1.81 — a  substantial 
improvement  over  the  myopic  score. 

The  reader  may  wish  to  download  an  Excel  workbook 
SLS.xls  from  http://diana.gl.nps.navy.mil/^washburn/.  On 
sheet  “Optimal,”  SLS.xls  computes  solutions  for  small  prob¬ 
lems  according  to  the  method  described  above. 

y(5,  T)  can  also  be  usefully  bounded  from  above.  See 
§1.3  below. 

Although  the  time  horizon  in  this  section  has  been 
assumed  to  be  infinite,  it  should  be  obvious  that  enough 
time  to  make  m  shots  is  all  that  is  required.  Even  smaller 
amounts  of  time  may  suffice.  Call  a  shot  “unitary”  if  it  has 
a  zero  kill  probability  against  all  targets  but  one.  Because 
such  shots  have  no  flexibility,  they  can  all  be  assigned  at 
once  without  jeopardizing  the  maximum  score  V{S,T). 
There  may  also  be  less  obvious  opportunities  for  shorten¬ 
ing  the  time  horizon.  Let  H{S,  T)  be  the  shortest  horizon 


within  which  there  still  exists  a  shooting  policy  that  will 
guarantee  V{S,T).  Computation  of  HiS,T)  is  itself  an 
interesting  topic,  but  we  will  not  pursue  it  further  here. 

1.2.  Perfect  Information,  Finite  Horizon, 

Identical  Shots  and  Targets 

The  analysis  in  §1.1  is  comparatively  simple  because  every 
shot  is  taken  in  perfect  knowledge  of  the  status  of  the  tar¬ 
get  set.  This  is  no  longer  true  if  either  time  is  constrained 
or  if  information  is  imperfect.  One  reason  for  time  to  be 
constrained  is  that  the  “targets”  might  actually  be  incoming 
missiles,  in  which  case  the  problem  is  one  of  self-defense. 
The  general  problem  appears  to  be  difficult,  although  vari¬ 
ous  specializations  such  as  those  in  this  section  can  still  be 
solved.  We  assume  that  all  shots  and  targets  are  identical, 
so  that  the  sets  S  and  T  can  be  replaced  by  simple  counts 
of  the  number  of  shots  and  targets  remaining. 

Suppose  that  P^j—\  —  q  for  all  i,  j,  where  q  is  the  miss 
probability  for  all  shots,  and  that  =  1  for  all  j.  Because 
all  targets  are  identical,  each  salvo  should  treat  the  remain¬ 
ing  targets  as  evenly  as  possible,  and  the  firing  question 
reduces  to  determining  how  many  shots  to  spend  in  each 
salvo.  Let  random  variable  Z  be  the  number  of  targets 
killed  out  of  t  when  x  shots  are  allocated,  and  let  /j,(5,  t) 
be  the  maximum  expected  number  that  can  be  killed  with  s 
shots,  t  targets,  and  n  salvos  remaining.  Then  Fg(i,  t)  —  0, 
and  for  n  >  0  we  have  the  dynamic  programming  recursion 

Fnis,  t)  =  max  £(Z  +  ^;_l(^  -x,t-  Z)).  (2) 

Calculating  the  expected  value  in  (2)  requires  a  distri¬ 
bution  for  Z.  The  distribution  is  binomial  if  x  is  less  than 
t  or  an  integer  multiple  of  t,  or  otherwise  the  convolution 
of  two  binomial  distributions.  In  either  case,  the  numerical 
solution  of  (2)  is  not  difficult  as  long  as  s  and  t  are  not  too 
large. 

Suppose  that  the  “targets”  are  actually  missiles  that  are 
attacking  home  base,  with  the  shots  being  defending  inter¬ 
ceptors.  The  number  of  salvos  is  limited  because  of  the 
speed  of  the  attackers.  Let  p  be  the  probability  that  any 
given  attacker  will  kill  home  base  if  not  intercepted,  and 
let  G„{s,  t)  be  the  largest  home  base  survival  probability 
when  there  are  s  shots,  t  targets,  and  n  salvos  remaining. 
Then  Gq{s,  i)  =  (1  —  p)',  and  for  n  >  0  we  have 

Gn(s,  t)  =  max  £'(G„_i(5  -x,t-  X)).  (3) 

0<x^s 

Computational  issues  are  the  same  as  with  (2).  Eckler 
and  Burr  (1972)  allude  to  (3)  and  give  some  results  for 
the  case  n  =  2.  Wilkening  (1999)  considers  the  case  where 
n  =  2  and  p  =  1  in  the  context  of  ballistic  missile  defense. 
A  reader  who  searches  the  Internet  for  “shoot-look-shoot” 
will  discover  that  this  scenario  is  frequently  referred  to. 

The  aforementioned  workbook  SLS.xls  includes  an 
implementation  of  (3)  on  sheet  “DynProg”  for  a  three-stage 


456 


Glazebrook  and  Washburn:  Shoot-Look-Shoot:  A  Review  and  Extension 
Operations  Research  52(3),  pp.  454^63,  ©2004  INFORMS 


problem  (n  =  3).  The  optimal  policy  with  only  one  stage 
remaining  is  to  use  all  remaining  interceptors,  but  with 
more  stages  the  optimal  policy  uses  fewer  interceptors  in 
order  to  avoid  wasting  them  on  dead  targets.  Of  course,  at 
least  one  shot  at  every  stage  is  always  made  at  every  sur¬ 
viving  target,  as  long  as  sufficient  shots  remain  to  do  so. 
The  spreadsheet  also  permits  a  minor  generalization:  The 
interceptor  kill  probability  can  depend  on  the  number  of 
stages  remaining. 

1.3.  Imperfect  Information 

It  is  possible  that  when  information  is  imperfect,  shots  will 
be  made  against  targets  that  are  already  dead.  The  proba¬ 
bility  that  shot  i  is  effective  against  target  j  remains 
but  an  effective  shot  will  kill  its  target  only  if  the  target  is 
alive  when  the  shot  is  taken. 

We  first  develop  an  upper  bound  on  the  best  target  value 
killed  that  is  valid  regardless  of  the  information  available  to 
the  marksman.  Fixed  sets  of  shots  and  targets  are,  as  usual, 
given.  Define  a  collection  of  indicator  random  variables  that 
can  be  associated  with  any  firing  policy  that  assigns  shots 
to  individual  targets: 

Yij  =  1  if  target  j  is  killed  by  shot  i  (for  each  j,  at  most 
one  of  these  can  be  1), 

Ty  =  1  if  target  j  is  killed,  and 

Xij  =  1  if  shot  i  is  assigned  to  target  j  (for  each  i,  at 
most  one  of  these  can  be  1). 

The  firing  policy  will  induce  many  correlations  between 
these  random  variables,  so  independence  assumptions 
among  them  are  not  appropriate,  but  the  collection  is  still 
useful  for  formulating  the  problem  of  finding  the  optimal 
policy.  We  first  note  that  T,  =  J2i  Yij,  and  that  the  total 
value  killed  is  Z  =  HjVjYj.  The  problem  (call  it  PI)  of 
finding  the  optimal  policy  can  therefore  be  posed  as  maxi¬ 
mizing  z,  the  expected  value  of  Z,  subject  to  the  following 
constraints: 

(a)  J2j  Xij  ^  1  for  all  i  with  certainty; 

(b)  Yj  —  Yij  for  all  j  with  certainty; 

(c)  Yj  ^  1  for  all  j  with  certainty;  and 

(d)  other  constraints. 

The  other  constraints  in  PI  include  the  crucial  relation¬ 
ship  between  and  F^,  the  essence  of  the  firing  policy. 
For  example,  the  (probably  foolish)  policy  of  ignoring  all 
information  and  simply  making  X,[  =  1  for  all  i  would 
probably  kill  target  1 ,  but  would  also  result  in  Fy  =  0  for 
y  >  1 .  There  are  potentially  an  astronomical  number  of  fir¬ 
ing  policies,  because  the  decision  about  what  to  do  next 
can  depend  in  many  ways  on  the  information  available. 
Nonetheless,  regardless  of  the  policy  employed,  we  assume 
that  E(Yij)  <  PijE{Xij),  with  strict  inequality  being  possible 
on  account  of  the  possibility  that  target  j  is  already  dead 
when  weapon  i  attacks  it.  Now,  using  lowercase  letters  for 
expected  values  of  random  variables,  we  can  construct  a 
relaxation  of  PI  that  we  name  P2: 

maximize  z  =  y^, 
j 


subject  to 

(a)  Y.jXij  <  1  for  all  i; 

(b)  Yj  <  Ei  XijPij  for  all  y; 

(c)  Yj  ^  1  for  all  j\  and 

(d)  all  variables  nonnegative. 

P2  is  a  relaxation  of  PI  because  a  relationship  that  is 
true  with  certainty  will  also  be  true  on  the  average,  because 
sums  and  expected  values  can  be  interchanged,  because 
E{Yij)  ^  XijPij,  because  the  other  constraints  of  PI 
have  simply  been  omitted.  P2  is  a  simple  linear  program 
that  provides  an  upper  bound  on  what  is  achievable  with 
any  shoot-look-shoot  policy.  P2  has  a  direct  interpretation 
where  is  the  probability  of  using  shot  i  on  target  j,  and 
Yj  is  the  probability  that  target  j  is  killed.  In  P2,  (a)  requires 
that  shot  i  not  be  used  more  than  once,  (b)  requires  that 
the  effect  of  each  shot  not  exceed  and  (c)  requires  that 
target  j  be  killed  at  most  once. 

If  bi  shots  of  type  i  are  actually  identical,  then  constraints 
(a)  of  P2  can  be  changed  to  J^/Xij  <  h;,  a  simple  conse¬ 
quence  of  collecting  terms  with  identical  coefficients  in  P2. 
In  that  case  the  interpretation  of  is  “average  number  of 
shots  of  type  i  used  on  target  j.” 

Because  perfection  is  a  special  case  of  imperfection,  the 
P2  upper  bound  also  applies  to  problems  of  the  type  consid¬ 
ered  in  §1.1.  In  fact,  the  aforementioned  workbook  SLS.xls 
also  computes  the  upper  bound  for  problems  defined  on 
sheet  “Optimal,”  with  the  upper  bound  computations  being 
carried  out  on  sheet  “LP_Upper”  when  the  command  but¬ 
ton  on  sheet  “Optimal”  is  pressed.  Using  it,  the  reader  can 
verify  that  the  upper  bound  for  the  small  example  intro¬ 
duced  in  §1.1  is  actually  exact,  and  that  the  upper  bound 
is  usually  sharp  for  problems  not  designed  to  make  it  look 
bad.  For  one  of  the  latter,  consider  a  problem  with  one  tar¬ 
get  and  two  shots,  each  of  which  will  kill  the  target  with 
probability  0.5.  The  optimal,  indeed  the  only,  allocation  of 
weapons  to  targets  produces  a  kill  probability  of  0.75,  while 
the  upper  bound  is  1. 

We  next  turn  to  the  construction  of  optimal  firing  policies 
in  specific  circumstances  where  information  is  imperfect. 

Manor  and  Kress  (1997)  consider  a  firing  problem  in 
which  all  shots  are  identical,  with  each  shot  having  kill 
probability  Pj  against  target  j.  If  target  j  is  not  killed,  there 
is  no  feedback  to  the  marksman.  If  the  target  is  killed,  that 
fact  is  confirmed  with  probability  qj,  or  otherwise  there 
is  no  feedback.  Information  would  be  perfect  if  qj  =  1, 
because  the  state  of  the  target  could  be  inferred  from  feed¬ 
back  or  lack  thereof.  If  qj  <  1,  however,  lack  of  feed¬ 
back  is  possible  from  live  targets,  as  well  as  dead  ones, 
so  the  marksman  will  not  be  certain  of  the  state  of  a  tar¬ 
get  unless  he  has  a  confirmed  kill.  Manor  and  Kress  argue 
that  such  a  model  applies  to  weapon  systems  such  as  fiber¬ 
optic  guided  missiles  (FOG-M).  The  marksman’s  object  is 
to  kill  as  many  targets  as  possible,  on  the  average,  with 
a  fixed  inventory  of  shots.  Manor  and  Kress  demonstrate 
that  the  “greedy”  policy  of  always  shooting  at  the  target 
for  which  the  immediate  gain  is  largest  is  actually  optimal. 
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The  targets  need  not  all  be  identical,  but,  if  they  are,  the 
policy  reduces  to  shooting  at  the  least-shot-at  target  among 
those  not  known  to  be  dead.  This  policy  results  in  a  lot  of 
switching  from  one  target  to  another,  a  regrettable  tendency 
when  speed  is  important.  Aviv  and  Kress  (1997)  explore 
other  policies  that  are  almost  as  good  but  that  do  not  switch 
targets  so  frequently. 

One  might  expect  the  greedy  policy  to  be  optimal  under 
more  general  status  reporting.  Suppose  then  that  there  are 
^  >  0  shots  and  t  >  0  targets,  and  let  random  variable 
indicate  whether  target  i  is  alive  at  time  k  =  1)  or 
not  (X,.^  =  0).  Let  =  1),  with  known,  i  = 

1, . . . ,  f.  Shots  are  made  one  at  a  time.  The  effect  of  a  shot 
at  target  i  at  time  k—\  is  to  kill  it  with  probability  \  —  q, 
if  it  is  not  already  dead,  and  to  produce  a  report  T^.  in  some 
countable  set.  It  is  assumed  that  Tj,  depends  only  on 
and  that  P(Tj,  =  y  |  Z,j,  =  x)  is  known  and  independent  of 
i  and  k  for  all  y,  and  for  x  =  0  or  1.  The  effect  of  these 
assumptions  is  that,  if  the  shot  is  aimed  at  target  i  and 
Tj,  =  yj,,  by  Bayes’  theorem, 

_ q^i.,-iny,=yAx>,  =  i) _ 

Because  77;^,  is  a  conditional  probability,  it  is  not — and 
need  not  be — defined  if  the  probability  of  the  conditioning 
event  (denominator  of  (4))  is  0.  There  is  no  effect  on  other 
targets;  that  is,  7r,j,  =  tT;  j._j  if  target  i  is  not  shot  at. 

It  is  sometimes  assumed  that  T^.  can  have  only  two 
values,  typically  “Live”  and  “Dead”  reports.  However,  Tj, 
might  also  be  some  physical  measurement  such  as  the  tem¬ 
perature  of  the  target  or  the  output  of  a  digital  filter  that 
measures  the  extent  to  which  the  target  optically  resembles 
the  thing  that  was  initially  shot  at.  As  long  as  the  condi¬ 
tional  probabilities  in  (4)  are  known  for  all  (/,  k),  Bayes’ 
theorem  applies. 

Let  a  policy  be  called  myopic  if,  at  every  time  A:  >  0,  it 
chooses  a  target  i  for  which  7r;j.  is  largest. 

Theorem  1.  Any  myopic  firing  policy  will  maximize  the 
expected  number  of  targets  killed  in  total. 

Proof.  The  proof  of  this  theorem  is  long,  so  we  defer  it  to 
the  appendix. 

If  M  is  the  total  number  of  targets  killed  using  a  myopic 
policy,  and  N  is  the  total  number  of  targets  killed  under  an 
arbitrary  policy,  then  one  might  expect  the  stronger  result 
that  M  stochastically  dominates  N .  However,  this  is  not 
the  case.  A  counterexample  is  t  —  2,  s  —  2,  and  q  —  0.5, 
with  no  information  between  shots  and  with  the  first  target 
being  alive  and  the  second  very  likely  dead.  The  myopic 
policy  will  fire  both  shots  at  the  first  target,  thus  making 
it  impossible  to  kill  both  of  them,  whereas  the  policy  of 
splitting  the  two  shots  has  at  least  a  slight  chance  of  killing 
both.  Therefore,  P{M  ^  1)  >  P{N  ^  1);  that  is,  M  does 
not  stochastically  dominate  N . 

It  is  also  not  true  that  only  myopic  policies  can  maximize 
the  expected  number  of  targets  killed.  A  counterexample 


is  f  =  3,  s  —  2  and  q  —  0.5,  with  no  information  between 
shots  and  both  targets  alive.  The  policy  of  shooting  twice 
at  target  1  then  once  at  target  2  is  optimal,  but  not  myopic. 

There  is  no  reason  to  expect  myopic  policies  to  be  opti¬ 
mal  in  more  general  circumstances  where  the  targets  or 
shots  differ,  nor  is  there  any  known  way  of  efficiently  com¬ 
puting  what  the  optimal  policy  is  in  those  circumstances. 
Oddly,  the  situation  can  be  simpler  if  the  number  of  shots 
is  random  rather  than  fixed.  If  the  number  of  shots  has  a 
geometric  distribution,  the  firing  problem  may  be  indexable 
(see  §4). 

2.  The  Unconstrained  Case 

We  now  assume  that  shots  are  available  in  unlimited  quan¬ 
tities,  as  long  as  they  are  paid  for  at  the  cost  of  c,  for  each 
shot  of  type  i.  We  will  assume  that  looks  are  also  expen¬ 
sive.  The  idea  that  looks  are  costly  or  in  short  supply  was 
not  incorporated  in  §1  but  is  actually  an  important  part  of 
some  shoot-look-shoot  problems. 

The  great  analytical  advantage  of  not  having  explicit  con¬ 
straints  on  shots  and  looks  is  that  the  firing  processes  at  the 
several  targets  decouple.  As  a  result,  problems  with  many 
different  targets  are  much  easier  to  solve  than  in  the  con¬ 
strained  case.  Therefore,  we  will  suppress  reference  to  the 
target  subscript  j  in  this  section. 

The  target  value  v  and  the  various  costs  must  of  course 
be  measured  in  commensurate  units.  This  may  seem  to 
make  the  model  unwieldy  in  a  practical  sense,  because  tar¬ 
get  values  are  often  just  relative  valuations  of  importance, 
while  shot  costs  are  monetary.  The  prospects  for  application 
are  not  as  bad  as  one  might  suppose.  Section  2.4  includes 
a  discussion  of  this  issue. 

2.1.  Perfect  Information,  Infinite  Horizon 

Let  qi  be  the  miss  probability  for  one  shot  of  type  i,  and 
assume  each  look  reveals  correctly  whether  the  target  is 
dead  or  alive.  To  avoid  trivialities  we  assume  that  the  look 
cost  d  is  positive.  The  marksman  must  pay  the  costs  of 
all  looks  and  shots  and  must  also  pay  the  target’s  value  v 
if  the  target  is  still  alive  after  all  shots  have  finished.  The 
marksman’s  objective  is  to  minimize  the  expected  sum  of 
all  costs. 

In  general,  the  marksman  will  not  know  whether  the 
target  is  dead  or  alive,  so  we  let  p  be  the  probability  of 
the  target  being  alive,  a  state  in  the  interval  [0,1].  To  this 
we  add,  for  convenience,  one  additional  state  P,  in  which 
the  marksman  has  retired  and  can  take  no  further  action. 
Except  in  state  P,  the  marksman  has  his  choice  of  looking 
or  shooting  at  the  target  in  various  ways,  or  of  choosing 
the  action  E  (for  “End”),  which  costs  nothing  but  sends  the 
state  to  P. 

Using  the  code  that  L  stands  for  looking  and  that  positive 
integer  i  stands  for  a  shot  of  type  i,  a  typical  firing  pol¬ 
icy  might  be  1123L44L11E.  It  should  be  understood  that 
firing  ends  immediately  after  any  look  that  reveals  that  the 
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target  is  dead.  Thus,  after  shooting  four  times,  the  marks¬ 
man  looks,  shoots  two  more  times  if  the  target  is  still  alive, 
looks  again,  shoots  two  more  times  if  the  target  is  still  alive, 
and  then  quits.  Any  policy  can  be  encoded  in  this  manner 
and  then  evaluated;  for  example,  the  miss  probability  for 
the  named  policy  is  q[q2%q4-  It  should  be  obvious,  how¬ 
ever,  that  there  is  no  hope  of  determining  the  optimal  policy 
by  exhaustion.  Instead,  we  employ  the  theory  of  partially 
observable  Markov  decision  processes  (Monahan  1982)  to 
first  determine  the  structure  of  the  optimal  policy. 

A  POMDP  requires  the  specification  of  two  functions: 
c(^,  a)  and  P(s|i,  a).  The  first  is  the  immediate  (average) 
cost  of  taking  action  a  when  the  target  is  in  state  s,  given 
in  the  present  case  by 


c(^,  a)  = 


0  if  s^R, 

C;  if  5  =  p  and 

d  if  s  =  p  and 

pv  if  s  —  p  and 


a  —  i, 
a  —  L, 
a  =  E. 


The  second  function  gives  the  probability  that  s'  will  be 
the  next  state,  given  that  the  current  state  and  action  are  s 
and  a.  It  is  given  by 


P{s'  I  5,fl)  = 


1 

1 

P 


if  s'  =  R  and  either  s  =  R  or  a  —  E, 
if  s'  —  pqi,s  —  p  and  a  —  i, 
if  s'  =  l,s  =  p  and  a  —  L, 


i—p  if  s' —  f),s  —  p  and  a  —  L. 


The  last  two  lines  correspond  to  looking,  with  the  two 
possibilities  after  a  look  being  that  the  next  state  is  1  or  0. 

Let  V (s)  be  the  minimal  total  cost  over  all  stages,  given 
that  the  initial  state  is  s.  Clearly,  V(R)  =  0,  because  sub¬ 
sequent  action  is  impossible  from  state  R.  Otherwise,  for 
p  in  the  interval  [0,1],  the  function  V(p)  is  concave  and 
satisfies  the  functional  equation 

y(/?)  =min|u/7;  d  +  pV(l);  imn(ci  -h  L(m,))},  (5) 

where  the  three  expressions  correspond  to  stopping  (action 
E),  looking  (action  L),  or  shooting  with  shot  i  (Strauch 
1966,  Theorem  9.1).  Furthermore,  there  is  an  optimal  sta¬ 
tionary  policy  that,  in  state  p,  chooses  the  action  corre¬ 
sponding  to  the  minimal  term. 

It  would  be  equivalent  to  let  U  (p)  —  vp—V  (p),  and  deal 
with  the  functional  equation 

U{p)  =max|o;  —d  +  pU{l); 

max(-Ci  -hpn(l  -  ^,.)  -f-  [/(p^,))}.  (6) 

U{p)  can  be  described  as  the  marksman’s  gain  relative  to 
quitting,  with  the  gain  for  choosing  E  being  0  and  the 
immediate  gain  from  shooting  being  the  expected  target 
value  killed  minus  the  cost  of  the  shot.  The  sum  of  U{p) 
and  V{p)  is  in  all  cases  vp.  While  the  two  formulations 
are  equivalent,  we  prefer  (5). 


We  have  the  following  structural  theorem. 

Theorem  2.  Action  E  is  optimal  in  an  interval  [0,  Py\  with 
Pi  ^  0.  Action  L  is  optimal  over  a  possibly  empty  interval 
[Pi^Pil  with  Pi  ^p2. 

Proof.  Action  E  is  optimal  at  0,  so  it  suffices  to  show  that 
the  optimality  set  is  convex.  Suppose  that  E  is  also  optimal 
at  pi,  and  let  U(p)  —  vp  ~  V{p).  The  function  U{p)  is 
convex  because  V{p)  is  concave,  and  f/(0)  —  U{pi)  —  f). 
Therefore,  U (p)  ^  0  on  the  interval  [0,  pi\.  But  U (p)  ^  0 
for  all  p  by  the  definition  of  V  (p),  so  actually  U  (p)  =  0  for 
0  ^  p  ^  Pi ;  that  is,  action  E  is  optimal  over  some  convex 
set.  The  proof  for  action  L  (or  any  action  whose  penalty  is 
a  linear  function  of  p)  is  similar.  The  only  state  p*  where 
E  and  L  have  the  same  penalty  is  the  one  where  vp*  —  d  + 
p*V{l).  The  optimality  sets  for  E  and  L  therefore  cannot 
overlap  except  perhaps  at  a  single  point,  so  P2  cannot  be 
smaller  than  pi.  □ 


A  stationary  policy  will  always  choose  the  same  action 
when  in  state  1,  which  happens  after  every  look  unless  the 
target  is  revealed  to  be  dead.  Therefore,  there  is  an  optimal 
stationary  policy  in  state  1  that  repeats  itself  after  every 
look  unless  the  target  is  revealed  to  be  dead,  unlike  the 
nonstationary  example  given  earlier.  If  such  a  policy  begins 
1123L,  then  it  must  continue  1123L1123L1123L. . .  indef¬ 
initely  until  the  target  is  finally  revealed  to  be  dead.  Let  x 
be  a  shot  sequence  of  the  form  (f),  i2, . . . ,  4)>  where  ij  is  a 
shot  type  for  1  ^  j  ^k.  Because  the  order  of  shots  is  imma¬ 
terial  in  X,  we  will  list  them  in  nondecreasing  order.  Adopt 
the  notation  xL  for  stationary  policies  that  shoot  x,  look, 
and  repeat  the  cycle  until  the  target  is  killed.  Another  pos¬ 
sibility  is  that  the  policy  shoots  x  and  then  quits,  for  which 
we  adopt  the  notation  xE.  The  final  possibility  is  that  the 
policy  simply  chooses  E  at  the  first  opportunity,  a  special 
case  of  xE  where  k  —  Q.  There  are  no  other  possibilities, 
so  we  have  Theorem  3. 


Theorem  3.  Let  c^  =  c,-  and  q^  =  n)=i  >  with  c^  = 

0  and  qx=i  if  k  —  0.  Then 


y(l)  =  min 


min 

X 


d  +  Cx 


;  wvn{vq^  -h  cj 

X 


(7) 


Proof.  Stationary  policies  of  the  xL  type  lead  to  the  first 
term.  The  cost  d  +  Cx  is  paid  a  geometric  number  of  times, 
1/(1  —  qx)  on  the  average,  before  the  target  is  finally  killed. 
Stationary  policies  of  type  xE  lead  to  the  second  term, 
with  ^  =  0  corresponding  to  ending  immediately.  No  other 
policies  need  be  considered  because  there  is  guaranteed  to 
be  an  optimal  policy  that  is  either  xL  or  xE.  □ 


A  similar  expression  could  be  proved  for  V{p)  in  gen¬ 
eral,  but  Theorem  3  will  suffice  because  the  target  is  nor¬ 
mally  assumed  to  start  in  the  live  state. 

Even  though  Theorem  3  provides  a  comparatively  simple 
way  to  determine  y(l),  the  amount  of  computation  might 
still  be  significant  if  there  were  many  shot  types.  The  fol¬ 
lowing  corollary  concerns  a  lower  bound  that  is  easier  to 
compute. 
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Corollary  1.  Let  i*  be  the  shot  type  that  minimizes 
C;/(— ln((7,)),  let  a  —  — ln(^;,),  and  let  c  —  c,-,.  Then 


y(l)  ^min 


min 

U>0 


1  —  exp(— aw) 


min(i)  exp(— an)  +  cu) 

u^O 


Proof.  Let  a,  =  — ln((7;),  and  let  m,  be  the  number  of  shots 
of  type  i  in  the  shot  sequence  x.  Then,  =  exp(— 
and  Cj.  =  where  both  sums  extend  over  all  shot 

types.  By  relaxing  the  minimization  (7)  to  permit  noninte¬ 
ger  values  for  m,,  we  obtain  the  bound 


y(l)  ^min 


tin  I 


d  +  T.iCiUi 


Vl-exp(-E,a/Mi)/ 


where  both  inner  minimizations  are  with  respect  to  the 
vector  (m,).  If  m,  >  0  for  i  ^  i*,  then  increase  M;,  hy 
and  set  m,  to  0.  The  resulting  change  will  leave 
Cj.  unchanged  while  not  increasing  q^,  so  it  suffices  to  con¬ 
sider  only  the  single  weapon  type  i*.  □ 

Corollary  2.  Let  r  —  min,  c,/(l  —  g,)  and  suppose  d  >  0. 
If  r  >  V,  then  the  optimal  action  in  state  \  is  E.  If  r  <  v, 
then  the  optimal  action  in  state  1  is  to  shoot. 

Proof.  We  first  note  hy  inspection  of  (5)  that  action  L  can¬ 
not  he  optimal  in  state  1,  regardless  of  r.  Let  p,  =  1  —  q-. 
More  generally,  let  p^  =  for  any  shot  sequence  x, 

and  note  that  Px  +  ^  f  for  all  x,  as  can  he  easily  proved 

hy  induction  on  the  length  of  x.  By  assumption,  ^  c, 
for  all  shot  types  i,  and  therefore,  p^r  ^  for  all  shot 
sequences  x.  It  follows  that  (1  —  qf)r  ^  c^  for  all  x.  Suppose 
xE  is  an  optimal  stationary  policy  for  some  nonnull  shot 
sequence  x.  Then,  v  ^  vq^^  -h  c^,  hence,  v  ^  c^/(l  —  qf),  and 
hence,  u  ^  r.  If  actually  v  <  r,  the  contradiction  shows  that 
X  can  only  he  null  and  establishes  that  the  optimal  action 
in  state  1  is  £.  If  u  >  r,  then  there  is  some  shot  type  i  for 
which  vqi  +  c^  <  v,  and  therefore  E  cannot  he  optimal.  □ 

Example.  Suppose  u  =  64,  =  8,  Cj  =  8,  Cj  —  3,q^—  0.5, 

q^  —  0.8.  It  can  he  shown  by  exhaustion  that  the  optimal 
policy  when  p  =  1  is  12L.  The  target  is  always  killed, 
and  the  average  cost  of  doing  so  is  y(l)  =  (8  4-  3  4-  8)/ 
(1  —  (0.5)(0.8))  =  31.67.  The  minimal  cost  with  a  pure 
shot  type  is  32,  achieved  with  any  of  policy  IL,  IIL,  HE, 
or  11  IE.  This  shows  that  it  is  not  in  general  possible  to 
confine  attention  to  “pure”  shooting  policies.  The  lower 
hound  is  30.912,  obtained  by  shooting  1.421  times  with 
shot  /*  =  1,  looking,  and  repeating  the  action  until  the  tar¬ 
get  is  killed.  According  to  Corollary  2,  the  optimal  action 
in  state  1  will  be  to  shoot  (either  with  shot  1  or  shot  2)  as 
long  as  i)  >  15. 


Figure  1.  Penalties  for  six  policies  (E,  2E,  L12L, 
2E12E,  1  IE,  12E)  in  the  example  problem  of 
§2. 1  are  shown  by  six  lines,  listed  in  increas¬ 
ing  order  of  the  intersection  with  the  y-axis. 


Note.  The  function  V (p)  is  the  lower  envelope  (concave  hull). 


Figure  1  shows  the  penalties  for  each  of  six  stationary 
policies  as  linear  functions  of  the  state  p.  All  other  policies 
are  dominated  by  one  of  these  six.  The  first  character  of 
each  policy  shows  the  (first)  optimal  action  in  that  state. 
Note  that  there  is  only  one  policy  that  begins  with  E  and 
one  that  begins  with  E,  consistent  with  Theorem  2. 

Solving  the  equation  64p  =  8  4-  py(l)  for  p,  we  find 
thaf  fhe  E  and  L  actions  are  tied  in  (5)  when  p  =  0.247,  af 
which  poinf  bofh  penalties  are  15.84.  However,  the  optimal 
policy  in  that  state  is  2E,  which  produces  a  smaller  penalty 
of  3  4-  0.8(0.247)64  =  15.65.  In  fact,  action  2  is  the  opti¬ 
mal  action  in  the  interval  [0.234,  0.256],  which  separates 
the  interval  where  E  is  the  optimal  action  from  the  interval 
where  E  is  the  optimal  action.  Thus,  we  have  a  counterin¬ 
tuitive  situation  where  increasing  the  probability  of  being 
alive  can  cause  the  optimal  action  to  change  from  shooting 
to  looking.  In  other  words,  the  set  of  states  where  shooting 
is  optimal  is  not  convex.  This  phenomenon  is  also  noted  by 
Ross  (1971)  in  a  different  Markov  decision  process.  States 
in  the  interval  [0.234,  0.256] — the  interval  where  2E  is  the 
optimal  policy — will  never  arise  if  the  initial  state  is  1  and 
the  optimal  policy  12E  is  followed. 

2.2.  Imperfect  Information,  Infinite  Horizon 

The  looks  and  shots  of  §2.1  can  each  be  regarded  as  special 
cases  of  a  generalized  shot.  Generalized  shot  i  (hereafter, 
simply  shot  i)  has  the  effect  of  first  killing  the  target  with 
probability  (1  —  qf),  and  then  providing  a  report  about  the 
new  state  of  the  target.  The  effect  of  such  a  shot  is  to 
first  change  the  state  from  p  to  p^,,  and  then  to  provide  a 
report  about  the  transformed  target  state.  Let  E,  (p^,  ,  k)  be 
the  posterior  probability  that  the  target  is  alive,  given  that 
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the  prior  probability  (before  the  shot)  is  p  and  that  report 
k  is  generated.  We  assume  this  function  to  be  known,  typ¬ 
ically  through  an  application  of  Bayes’  theorem.  If  shot  i 
is  selected  as  the  next  action,  the  resulting  report  will  be 
a  random  variable  K  with  a  known  distribution,  and  the 
next  state  will  be  a  random  variable  bXpqi,  K).  Shots  with 
no  information  output  can  be  modeled  by  providing  only 
one  possibility  for  K  (in  which  case  biipq^,  K)  =  pqi), 
and  shots  that  provide  only  information  (“sensors”)  can  be 
modeled  by  setting  =  1 . 

Let  V{p)  be  the  marksman’s  minimal  total  loss,  includ¬ 
ing  all  shooting  costs  and  the  cost  of  the  target’s  survival. 
The  generalized  version  of  (5)  is  then 

y(p)  =min|u/7;  niin(Ci -f  £(y(h,.(p^,.,  /if))))}.  (8) 

The  expected  value  in  (8)  is  with  respect  to  the  distribution 
of  K,  the  sensor  report. 

The  same  existence  results  apply  to  (8)  as  to  (5).  How¬ 
ever,  there  is  no  counterpart  to  (7)  in  this  case  because  of 
the  complicated  structure  associated  with  looks.  Solution 
of  (8)  must  be  by  an  approximation  method  such  as  policy 
or  value  iteration  (Thomas  et  al.  1983,  Puterman  1994). 

2.3.  Finite  Horizon 

Stationary  policies  can  no  longer  be  expected  when  the 
time  horizon  is  hnite.  The  best  action  will  depend  on  time, 
as  well  as  on  the  probability  that  the  target  is  still  alive,  with 
the  marksman  sometimes  resorting  to  increasingly  expen¬ 
sive  shots  when  time  is  almost  used  up.  Obtaining  POMDP 
solutions  is  more  difficult  and  essentially  numerical,  but  at 
least  the  problem  of  firing  at  many  targets  still  decouples; 
there  is  no  need  to  consider  a  joint  firing  problem  as  long 
as  shots  have  costs,  rather  than  constraints.  See  Monahan 
(1982)  or  Lovejoy  (1991)  for  surveys  of  solution  methods. 

Models  of  this  sort  are  particularly  attractive  if  there  are 
many  targets  of  the  same  type,  since  the  policy  found  in 
solving  one  POMDP  applies  to  every  target.  Exactly  such 
a  model  lies  at  the  heart  of  Yost  (1998),  who  considers 
problems  where  the  number  of  target  types  is  small,  even 
though  the  number  of  targets  is  large. 

2.4.  Connection  to  the  Constrained  Case 

In  the  example  of  §2.1,  the  target  is  always  eventually 
killed  after  expending  1.667  looks,  1.667  shots  of  type  1, 
and  1.667  shots  of  type  2,  on  the  average.  If  there  were 
n  =  600  targets,  the  expenditures  to  kill  all  of  them  would 
scale  up  to  1,000  for  each  resource.  Now,  consider  the  con¬ 
strained  problem  of  killing  as  many  of  600  identical  tar¬ 
gets  as  possible,  subject  to  constraints  of  1,000  on  each 
of  the  three  resources,  a  problem  that  is  much  too  hard 
to  solve  using  the  techniques  of  §1.  We  have  seemingly 
almost  stumbled  onto  a  solution,  because  the  application 
of  the  optimal  policy  to  each  of  the  targets  indepen¬ 
dently  will  consume  exactly  the  right  resources  in  total. 


Everett  (1963)  shows  that  there  can  be  no  better  solution  to 
the  constrained  problem,  but  there  are  still  two  obstacles  to 
application. 

One  obstacle  is  that  the  constraints  have  to  be  interpreted 
as  constraints  holding  only  on  the  average,  because  the 
consumption  of  the  three  resources  in  the  joint  POMDP 
is  1,000  on  the  average.  On  the  other  hand,  constraints  in 
the  actual  firing  problem  are  more  likely  to  be  required  to 
hold  with  certainty.  This  problem  is  least  objectionable  as 
the  number  n  of  targets  grows  large.  The  standard  devia¬ 
tion  of  the  total  allocation  in  each  case  is  proportional  to 
the  square  root  of  n,  while  the  mean  is  proportional  to  n, 
so  the  ratio  of  the  two  (the  coefficient  of  variation)  will 
approach  zero  as  n  approaches  inhnity.  In  this  sense,  the 
unconstrained  case  can  be  thought  of  as  a  solution  method 
for  constrained  problems  where  n  is  large  because  resource 
consumption  fluctuations  become  of  less  and  less  concern 
as  n  increases. 

The  other  obstacle  is  that  the  constraints  are  unlikely 
to  be  met  exactly,  even  on  the  average,  because  the  coin¬ 
cidence  that  the  shot  and  look  costs  happen  to  be  cho¬ 
sen  so  that  the  constraints  are  all  exactly  satisfied  is  too 
much  to  hope  for.  There  needs  to  be  some  mechanism  for 
adjusting  the  shot  costs  to  make  that  happen.  Yost  and 
Washburn  (2000)  propose  to  do  this  in  an  iterative  scheme 
where  a  linear  program  produces  dual  variables  that  play 
the  role  of  costs,  while  the  POMDP  uses  the  costs  to  pro¬ 
duce  new  policies  that  end  up  being  columns  in  the  linear 
program.  This  mechanism  for  deriving  resource  costs  from 
constraints  makes  it  unnecessary  to  make  a  priori  judg¬ 
ments  about  the  comparative  values  of  targets  and  shots. 

The  Yost- Washburn  method  is  a  column-generation  tech¬ 
nique  for  hnding  the  correct  costs  and  associated  optimal 
policies.  It  is  characteristic  of  such  schemes  that  feasible 
solutions  show  rapid  improvements  at  hrst,  but  that  conver¬ 
gence  in  the  tails  is  very  slow.  Because  upper  and  lower 
bounds  are  available  at  all  times  when  using  that  method, 
a  practical  implementation  will  incorporate  a  termination 
rule  that  accepts  nearly  optimal  solutions. 

3.  The  Semi-Constrained  Case 
with  Perfect  Information 

Assume  that  each  shot  i  in  some  set  S  has  a  cost  c,  and 
a  kill  probability  p^  —  1  —  q^,  and  the  object  is  to  kill  the 
single  target  as  cheaply  as  possible.  The  target  is  presumed 
to  be  so  valuable  that  shooting  will  not  stop  until  either 
shots  are  exhausted  or  the  target  is  killed.  The  probabil¬ 
ity  of  killing  the  target  is  therefore  1  —  H/ss  <ii — the  usual 
“powering  up”  formula  that  applies  to  independent  shots. 
Because  shooting  will  stop  if  the  target  is  killed,  it  makes 
intuitive  sense  to  first  use  shots  with  high  kill  probabili¬ 
ties  and  low  costs.  The  following  theorem  makes  this  idea 
precise. 

Theorem  4.  Rank  the  shots  according  to  increasing  values 
of  the  cost/effectiveness  ratio  cjpi-  The  minimal  average 
cost  of  firing  is  achieved  by  making  the  shots  in  that  order. 
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Proof.  Let  Q,  =  ]”[;</  If  the  shots  are  made  in  index 
order,  is  the  probability  that  all  shots  before  the  ith 
shot  fail  to  kill  the  target,  and  therefore  also  the  prob¬ 
ability  that  the  ith  shot  will  be  required.  The  expected 
cost  of  bring  is  therefore  C  =  J2ies*^iQr  L®t  Q  be  the 
expected  cost  of  bring  if  the  kth  shot  is  interchanged  with 
the  k  +  1st  in  the  bring  order,  and  let  =  C  —  Q.  All  but 
two  terms  from  each  sum  cancel  in  the  subtraction,  so  d  = 

Qk{iCk  +  Ck+iqic)-{Ck+i  +  Ckqk+i))  =  Qk{CkPk+i-Ck+iPk)- 

The  difference  d  will  be  positive  if  and  only  if  Ci^/pf.  > 
^k+i/Pk+i-  Because  k  is  arbitrary,  this  shows  that  the  shots 
should  be  ordered  as  stated.  □ 

4.  Applicability  of  Bandit  Processes 

A  “bandit”  process  is  one  where  the  decision  maker  must  at 
each  time  choose  one  of  a  bxed  number  of  Markovian  ban¬ 
dits,  thus  changing  the  state  of  the  selected  bandit,  receiv¬ 
ing  a  reward,  and  receiving  some  information  about  the 
changed  state  of  the  selected  bandit.  Bandits  other  than  the 
one  chosen  are  unaffected.  Rewards  are  time-discounted  by 
the  factor  0  ^  /3  <  1,  and  the  decision  maker’s  object  is  to 
maximize  the  expected  value  of  the  total  discounted  reward. 
The  original  application  was  to  slot  machines  (one-armed 
bandits)  with  unknown  payout  statistics,  hence  the  name.  In 
the  present  context,  the  decision  maker  is  a  marksman  who 
must  at  each  time  choose  a  bandit  target  to  shoot  at.  The 
attractive  feature  of  bandit  processes  is  that  they  are  index¬ 
able;  that  is,  there  exists  a  Gittins  index,  computable  for 
each  bandit  separately  from  the  others,  such  that  choosing 
the  bandit  with  the  largest  index  at  all  times  is  an  opti¬ 
mal  policy.  The  Gittins  index  depends  on  the  state  of  the 
bandit  and  is  generally  difficult  to  compute.  Even  so,  the 
separability  of  each  bandit  from  the  others  is  an  attractive 
feature. 

An  alternative  interpretation  is  that  rewards  are  not  dis¬ 
counted  but  the  bandit  process  may  be  terminated  at  any 
time,  after  which  there  are  no  further  rewards.  With  this 
interpretation,  (5  is  the  probability  that  the  bandit  process 
will  not  terminate  after  each  decision,  in  which  case  the 
number  of  decision  opportunities  is  a  gradually  revealed 
geometric  random  variable  with  mean  1/^.  In  the  present 
shooting  context,  1  /fi  is  the  average  number  of  shots  avail¬ 
able  to  the  marksman. 

Bandit  processes  have  occasionally  been  applied  to 
shooting  problems.  For  example,  Glazebrook  et  al.  (2001) 
consider  a  two-sided  problem  where  each  bandit  is  an 
attacker  of  unknown  type,  and  where  the  defending  marks¬ 
man’s  problem  is  to  decide  which  bandit  to  shoot  at  next, 
given  that  the  bandit  will  return  fire  if  not  killed.  Barkdoll 
et  al.  (2002)  consider  a  related  problem  in  suppression  of 
enemy  air  defenses  (SEAD)  where  the  decision  maker  is 
the  enemy  air  defense  supposedly  being  suppressed.  The 
marksman  must  not  only  decide  which  bandit  (i.e.,  attacker) 
to  shoot  at  next,  but  also  how  his  engagement  radar  should 
operate  in  support  of  the  shot,  whether  in  continuous  or 


intermittent  mode.  Here,  the  decisions  concern  not  only  the 
choice  of  bandit,  but  also  how  the  chosen  bandit  should  be 
dealt  with.  This  additional  feature  formally  takes  the  deci¬ 
sion  process  out  of  the  class  for  which  index  strategies  are 
known  to  be  optimal.  Even  so,  Barkdoll  et  al.  (2002)  pro¬ 
pose  an  index-based  heuristic  for  their  shooting  problem 
that  performs  well  in  numerical  experiments.  See  Glaze¬ 
brook  and  Eay  (1990)  for  a  theoretical  discussion  of  such 
developments  in  bandit  problems. 

In  the  interest  of  formulating  a  problem  where  the  Gittins 
index  can  be  derived  explicitly,  we  consider  next  a  simple, 
one-sided  problem  similar  to  those  considered  in  previous 
sections. 

The  decision  maker  cannot  have  any  choices  other  than 
bandit  selection,  so  our  marksman  cannot  have  his  choice 
of  several  shot  types  once  he  selects  a  target.  Therefore,  we 
introduce  the  following  modihed  version  of  Equations  (6) 
and  (8),  the  modihcations  being  to  delete  the  shot  subscript 
i,  and  replace  the  option  of  retiring  with  0  payoff  by  the 
option  of  retiring  with  a  payoff  of  M: 

U {p,  M)  =  max(M;  — c  -|-  pv(l  —  q) 

+  mU{b{pq,K),M))).  (9) 

The  Gittins  index  M{p)  is  the  smallest  value  of  M  for 
which  the  options  of  retiring  and  continuing  are  equivalent 
(Whittle  1982).  In  general,  M{p)  must  be  computed  by  an 
iterative  numerical  procedure,  but  the  case  where  informa¬ 
tion  is  perfect  is  an  exception.  In  that  case  (9)  reduces  to 

U{p,  M)  —  max(M;  —c  +  pv{\  —  q) 

+  li[pqUi\,M)  +  {\- pq)U{Q,M)]).  (10) 

We  will  first  hnd  t/(0,  M),  then  t/(l,  M),  and  then  deal 
with  the  general  case.  Because  t/(0,  M)  =  max(M,  — c  -|- 
[31/(0,  M)),  it  follows  that  U(0,M)  =  max(M,  — c/ 
(1  —  [3)),  and  therefore  that  M(0)  =  — c/(l  —  [3). 

For  M  ^  M(0),  U (0,  M)  =  M,  and  therefore, 

U{1,  M)  —  max(M;  — c -|-  u(l  —  (?) 

+  l3{qU(\,M)  +  (\-q)M}).  (11) 


Equating  U(\,  M)  to  the  continuation  expression  in  (10) 
and  solving  for  1/(1,  M),  we  see  that  (11)  is  equivalent  to 


{/(I,  M)  =  max 


-c  -h  ii(l  -  q)+  (3(1  -  q)M 

1-/3? 


(12) 


For  values  of  M  between  M(0)  and  M(l),  U  (0,  M)  =  M 
and  t/(l,  M)  is  the  continuation  expression  in  (12),  so  for 
the  general  case  0  ^  p  ^  1  we  have  Equation  (13) 


U (p,  M)  —  max^M;  — c  -|-  pv(\  —  q) 

-c  +  u(l  -  q)+  /3(1  -  q)M 


pq- 


1-/3? 

+  (\-pq)M\].  (13) 
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The  smallest  (in  fact  only)  value  of  M  for  which  the  two 
expressions  in  (13)  are  equal  is  the  Gittins  index, 


M(p)  = 


pv(i-q) 


c 

1^’ 


0<P<1.(14) 


If  there  are  actually  several  targets,  then  we  have  only  to 
add  subscripts  to  p,  c,  v,  and  q  to  have  a  shooting  index — a 
number  that  determines  the  next  shot  in  all  circumstances. 

The  index  policy  is  myopic  when  ^  =  0  because  the 
target  with  the  largest  immediate  gain  pv{\  —  q)  —  c  is 
always  chosen.  However,  in  general  M{p)  is  an  increasing, 
concave  nonlinear  function  of  p,  and  the  best  policy  is  not 
necessarily  myopic. 

Example.  Suppose  /3  =  0.9,  and  that  there  are  two  targets 
with  q  —  0.5  and  c  =  0  for  each.  If  Uj  =  1  and  p^  —  \, 
the  index  for  target  1  is  5.  If  V2  —  ^  and  p2  —  0.2,  the 
index  for  target  2  is  6.25.  The  myopic  rule  would  prefer  to 
engage  target  1  first,  but  the  optimal  rule  prefers  target  2. 
Valuable  targets  with  a  low  probability  of  being  alive  have 
a  surprisingly  large  index  on  account  of  the  concavity  of 
M{p). 


5.  Summary 

We  have  reviewed  the  shoot-look-shoot  state  of  the  art  and 
introduced  several  new  results. 

As  often  happens,  the  difficult  constrained  problems  are 
in  the  middle.  The  problems  of  §1  are  tractable  when  the 
numbers  of  shots  and  targets  are  small.  When  all  numbers 
are  large,  the  methods  of  subsequent  sections  may  produce 
approximate  solutions  by  decoupling  the  targets.  Problems 
with  moderate  numbers  of  targets  and  shots  remain  diffi¬ 
cult,  particularly  if  information  is  not  perfect. 


Appendix.  Proof  of  Theorem  1 

Proof  (refer  to  §1.3  for  notation  and  exact  statement  of 
Theorem  1).  For  y  an  observation  and  tt  a  probability,  we 
first  introduce  the  function 

,  _ q7TP{Y  =  y\X=\) _ 

g77P(T  =  y|  A=l)  +  (l-^77)P(T  =  y|  A  =  0) 

(recall  that  the  conditional  probabilities  are  assumed 
known).  If  (yi,...,y<:)  is  a  sequence  of  observations,  we 
generalize  by  defining 

T{Tr,{y^,...,y^))  =  T{T{Tr,  (yi, . . . ,  y,t_i)),  yj. 

With  this  notation.  Equation  (4)  is  77, j.  —  T{tTi  y^.). 
Equivalently,  77, =  T{TTj  q,  (yj, . . . ,  y^.)).  We  first  observe 
that  r(77,  y)  is  a  nondecreasing  function  of  77  for  any 
observation  or  sequence  of  observations,  as  is  easily  proved 
by  induction. 

Let  77^  =  max,{77,j,}  be  the  largest  probability  of  being 
alive  at  time  k.  The  theorem  states  that  any  myopic  policy 


is  optimal,  a  myopic  policy  being  any  policy  that  always 
chooses  a  target  i  for  which  77,j,  =  77^  at  all  times  k  ^  0. 
Our  proof  is  by  induction  on  s,  the  number  of  shots.  The 
proof  is  trivial  if  ^  =  1.  The  problem  is  to  show  that  there  is 
some  optimal  policy  that  makes  the  myopic  choice  at  A:  =  0 
because  it  can  be  assumed  by  the  induction  principle  that 
the  myopic  choice  is  optimal  for  k  >0.  Suppose  that  some 
optimal  policy  P  first  chooses  target  2,  where  7720  <  77^.  We 
can  assume  that  P  proceeds  myopically  for  k  >  0,  because 
myopic  continuations  are  optimal  by  induction.  We  will 
show  that  there  is  a  different  policy  Q  that  first  chooses 
myopically  and  that  is  at  least  as  good  as  P. 

Because  P  is  myopic  for  k  >  0,  it  will  shoot  at  tar¬ 
get  2  until  TT21C  <  77q  at  some  random  stopping  time  K, 
or  until  shots  are  exhausted,  whichever  comes  first.  Let  a 
sequence  of  observations  cr  =  (yj, . . . ,  y^.)  be  called  “criti¬ 
cal”  if  observation  of  that  sequence  of  reports  is  included 
in  the  event  {K  =  k),  let  2(A:)  be  the  set  of  all  criti¬ 
cal  sequences  of  length  k,  and  let  2  =  U^=i  2(A:).  P  will 
switch  from  target  2  to  some  other  target  immediately  after 
some  critical  sequence  in  2  is  observed,  unless  shots  are 
exhausted  first. 

Suppose  that  P  switches  from  target  2  to  target  1  after 
time  K,  and  note  that  target  1  is  necessarily  a  myopic  target 
at  time  0  as  well  as  time  K,  because  target  Ts  proba¬ 
bility  of  being  alive  is  tTq  at  both  times.  Suppose  further 
that  P  next  switches  from  target  1  to  some  other  target 
(possibly  target  2)  after  some  observation  sequence  cr  = 
(yi,...,y*.)  associated  with  target  1  is  observed.  Then  it 
must  be  true  that  T(77jo,  cr)  <  77g.  Because  tTjq  ^  tt^q,  and 
because  T (77,(7)  is  a  nondecreasing  function  of  77,  it  is 
also  true  that  T(772o,  cr)  <  that  is,  either  cr  or  some 
subsequence  of  cr  must  be  critical.  In  other  words,  for  any 
observation  history  for  which  P  continues  to  shoot  at  tar¬ 
get  2,  P  will  also  continue  to  shoot  at  target  1,  given  the 
same  reports  about  target  1.  In  fact,  P  can  be  described 
as  the  policy  that  first  shoots  at  target  2  according  to  2, 
then  shoots  at  target  1  according  to  2,  and  then  proceeds 
myopically  (possibly  by  continuing  to  shoot  at  target  1) 
as  long  as  shots  remain.  We  refer  to  the  first  two  parts  of 
P  as  being  2-controlled.  Let  Q  be  the  policy  that  simply 
switches  targets  1  and  2  in  this  description.  We  show  that 
Q  is  at  least  as  good  as  P,  thus  proving  the  theorem. 

In  comparing  P  and  Q,  we  need  only  compare  the 
chances  of  killing  targets  1  and  2  under  the  2-controlled 
phase.  This  is  because  the  policies  differ  only  in  the  order 
in  which  targets  1  and  2  are  engaged,  so  the  distribution  of 
the  system  state  at  the  end  of  2-controlled  shooting,  includ¬ 
ing  the  possibility  that  all  shots  are  exhausted,  is  the  same 
in  either  case. 

We  now  consider  the  four  possibilities  for  whether  tar¬ 
gets  1  and  2  are  actually  alive  at  time  0.  If  both  are 
dead,  clearly  P  and  Q  will  be  equally  effective  under 
2-controlled  shooting.  The  same  is  true  if  both  are  ini¬ 
tially  alive,  because  the  same  stopping  rule  is  applied  to 
both  targets,  so  consider  the  case  where  only  one  target  is 
alive.  Let  D(t)  be  the  probability  of  killing  a  target  that  is 
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shot  at  until  shooting  is  stopped  by  either  X  or  the  horizon 
t,  whichever  comes  first,  given  that  the  target  is  initially 
live,  and  note  that  D{t)  is  a  nondecreasing  function  of  t. 
If  the  live  target  is  engaged  first,  then  the  probability  of 
killing  it  under  X-controlled  fire  is  D(s).  If  it  is  engaged 
second,  then  let  X  be  the  number  of  shots  used  (wasted) 
by  first  shooting  at  the  one  that  is  dead.  The  probability 
of  killing  the  live  target  is  then  D*  =  E{D{s  —  X)),  which 
cannot  exceed  D{s)  on  account  of  the  monotonic  nature  of 
D{t).  Considering  only  targets  1  and  2  under  X-controlled 
fire,  the  average  number  of  targets  killed  by  policy  P  is, 
therefore,  7r2o(l  —  '^■^o)P)(^)  +  7r[o(l  —  tt2o)D*,  with  a  sim¬ 
ilar  expression  with  targets  1  and  2  reversed  for  policy  Q. 
Because  7720  ^  tTjq  and  D*  ^  D{s),  it  is  trivial  to  show  that 
policy  Q  will  kill  at  least  as  many  targets  as  policy  P. 

Because  Policy  Q  has  been  shown  to  be  at  least  as  good 
as  policy  P,  and  because  policy  Q  begins  myopically,  this 
completes  the  inductive  proof  that  all  myopic  policies  are 
optimal.  □ 
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